The IIT Bombay Hindi-English Translation System at WMT 2014

نویسندگان

  • Piyush Dungarwal
  • Rajen Chatterjee
  • Abhijit Mishra
  • Anoop Kunchukuttan
  • Ritesh M. Shah
  • Pushpak Bhattacharyya
چکیده

In this paper, we describe our EnglishHindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly. We show improvements to the translation systems using pre-procesing and post-processing components. To overcome the structural divergence between English and Hindi, we preorder the source side sentence to conform to the target language word order. Since parallel corpus is limited, many words are not translated. We translate out-of-vocabulary words and transliterate named entities in a post-processing stage. We also investigate ranking of translations from multiple systems to select the best translation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The IIT Bombay English-Hindi Parallel Corpus

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi compiled from a variety of existing sources as well as corpora developed at the Center for Indian Language Technology1, IIT Bombay over the years. The training corpus consists of sentences, phrases as well as dictionary entries, spanning many applications and domains. The details of the training corpus are shown in T...

متن کامل

The IIT Bombay SMT System for ICON 2014 Tools Contest

In this paper, we describe our submission to the ICON 2014 Tools Contest for Machine Translation. The source languages are English, Marathi, Tamil, Telugu, Bengali and the target language is Hindi. We submitted 15 systems; 5 each for the tourism, health and general domains. Our submission is a Phrase-based Statistical Machine Translation system with preprocessing and post-processing elements. A...

متن کامل

Edinburgh's Syntax-Based Systems at WMT 2014

This paper describes the string-to-tree systems built at the University of Edinburgh for the WMT 2014 shared translation task. We developed systems for English-German, Czech-English, FrenchEnglish, German-English, Hindi-English, and Russian-English. This year we improved our English-German system through target-side compound splitting, morphosyntactic constraints, and refinements to parse tree ...

متن کامل

Edinburgh's Phrase-based Machine Translation Systems for WMT-14

This paper describes the University of Edinburgh’s (UEDIN) phrase-based submissions to the translation and medical translation shared tasks of the 2014 Workshop on Statistical Machine Translation (WMT). We participated in all language pairs. We have improved upon our 2013 system by i) using generalized representations, specifically automatic word clusters for translations out of English, ii) us...

متن کامل

Hindi Word Sense Disambiguation

Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai India {manish, mahesh, pb,pandey,yupu}@cse.iitb.ac.in Abstract Word Sense Disambiguation (WSD) is defined as the task of finding the correct sense of a word in a specific context. This is crucial for applications like Machine Translation and Information Extraction. While the work on automatic WSD for En...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014